Medical Dataset - Segmenting Patients¶

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
df=pd.read_csv('patient_dataset.csv', index_col=0)
df.head()
Out[2]:
age gender chest pain type blood pressure cholesterol max heart rate exercise angina plasma glucose skin_thickness insulin bmi diabetes_pedigree hypertension heart_disease Residence_type smoking_status triage
0 40.0 1.0 2.0 140.0 294.0 172.0 0.0 108.0 43.0 92.0 19.0 0.467386 0.0 0.0 Urban never smoked yellow
1 49.0 0.0 3.0 160.0 180.0 156.0 0.0 75.0 47.0 90.0 18.0 0.467386 0.0 0.0 Urban never smoked orange
2 37.0 1.0 2.0 130.0 294.0 156.0 0.0 98.0 53.0 102.0 23.0 0.467386 0.0 0.0 Urban never smoked yellow
3 48.0 0.0 4.0 138.0 214.0 156.0 1.0 72.0 51.0 118.0 18.0 0.467386 0.0 0.0 Urban never smoked orange
4 54.0 1.0 3.0 150.0 195.0 156.0 0.0 108.0 90.0 83.0 21.0 0.467386 0.0 0.0 Urban never smoked yellow

Defining Problem Statement and perform Exploratory Data Analysis¶

Triage refers to the sorting of injured or sick people according to their need for emergency medical attention. It is a method of determining priority for who gets care first.

As the dataset is seen

Based on patient symptoms,

    Identify patients needing immediate resuscitation; 
    To assign patients to a predesignated patient care area, 
    Thereby prioritizing their care; 
    And to initiate diagnostic/therapeutic measures as appropriate 
  • The dataset includes demographic, lifestyle, and health-related features, such as age, gender, cholesterol levels, blood pressure, BMI, diabetes history, and smoking status.

  • Apply unsupervised learning techniques such as K-Means, Gaussian Mixture Model and Hierarchical Clustering to segment the data into meaningful clusters.

      The study will explore whether these clusters reveal distinct patient groups that could be useful for medical research, risk stratification, or personalized treatment plans. 

Observations on the data types of all the attributes¶

In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 6962 entries, 0 to 5109
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   age                6962 non-null   float64
 1   gender             6961 non-null   float64
 2   chest pain type    6962 non-null   float64
 3   blood pressure     6962 non-null   float64
 4   cholesterol        6962 non-null   float64
 5   max heart rate     6962 non-null   float64
 6   exercise angina    6962 non-null   float64
 7   plasma glucose     6962 non-null   float64
 8   skin_thickness     6962 non-null   float64
 9   insulin            6962 non-null   float64
 10  bmi                6962 non-null   float64
 11  diabetes_pedigree  6962 non-null   float64
 12  hypertension       6962 non-null   float64
 13  heart_disease      6962 non-null   float64
 14  Residence_type     6962 non-null   object 
 15  smoking_status     6962 non-null   object 
 16  triage             6552 non-null   object 
dtypes: float64(14), object(3)
memory usage: 979.0+ KB

Missing value check¶

In [4]:
print('Missing Values in the dataset ')
df.isna().sum()
Missing Values in the dataset 
Out[4]:
age                    0
gender                 1
chest pain type        0
blood pressure         0
cholesterol            0
max heart rate         0
exercise angina        0
plasma glucose         0
skin_thickness         0
insulin                0
bmi                    0
diabetes_pedigree      0
hypertension           0
heart_disease          0
Residence_type         0
smoking_status         0
triage               410
dtype: int64
In [5]:
print("Total Missing Values ")
df.isna().sum().sum()
Total Missing Values 
Out[5]:
411

Outlier detection¶

In [6]:
# Extracting Numerical data from the pool
numeric_data=df.select_dtypes('number')
numeric_data.head(5)
Out[6]:
age gender chest pain type blood pressure cholesterol max heart rate exercise angina plasma glucose skin_thickness insulin bmi diabetes_pedigree hypertension heart_disease
0 40.0 1.0 2.0 140.0 294.0 172.0 0.0 108.0 43.0 92.0 19.0 0.467386 0.0 0.0
1 49.0 0.0 3.0 160.0 180.0 156.0 0.0 75.0 47.0 90.0 18.0 0.467386 0.0 0.0
2 37.0 1.0 2.0 130.0 294.0 156.0 0.0 98.0 53.0 102.0 23.0 0.467386 0.0 0.0
3 48.0 0.0 4.0 138.0 214.0 156.0 1.0 72.0 51.0 118.0 18.0 0.467386 0.0 0.0
4 54.0 1.0 3.0 150.0 195.0 156.0 0.0 108.0 90.0 83.0 21.0 0.467386 0.0 0.0
In [7]:
plt.figure(figsize=(15, 15), layout="constrained", frameon=True)
i=1
for col in numeric_data.columns:
    plt.subplot(4, 4, i)
    sns.boxplot(df[col], color="#e63946")
    plt.title(col)
    i += 1
plt.show()
No description has been provided for this image
  • As from the above Graph we see the potential Features for Outliers can be :
    • cholesterol
    • plasma glucose
    • insulin
    • bmi
    • diabetes_pedigree
In [8]:
outlier_features=df[['cholesterol', 'plasma glucose', 'insulin', 'bmi', 'diabetes_pedigree']]
In [9]:
plt.figure(figsize=(15, 8), layout="constrained", frameon=True)
i = 1
for col in outlier_features:
    plt.subplot(2, 3, i)
    sns.histplot(df[col], kde=True, color="#2a9d8f")
    plt.title(col)
    i += 1
plt.show()
No description has been provided for this image

Relationship between important variables¶

In [10]:
plt.figure(figsize=(15, 6))
sns.lineplot(
    x=df["hypertension"],
    y=df["max heart rate"],
    hue=df["triage"],
    errorbar=None,
    hue_order=["yellow", "orange", "green", "red"],
)
plt.title("Max Heart Rate vs. Hypertension")
plt.show()
No description has been provided for this image
In [11]:
plt.figure(figsize=(15, 6))
sns.barplot(
    x=df["exercise angina"],
    y=df["max heart rate"],
    hue=df["triage"],
    hue_order=["yellow", "orange", "green", "red"],
)
plt.title("Max Heart Rate vs. Exercise Angina")
plt.show()
No description has been provided for this image
In [12]:
plt.figure(figsize=(15, 6))
sns.violinplot(
    x=df["heart_disease"], y=df["age"], palette="coolwarm", hue=df["heart_disease"]
)
plt.title("Age vs. Heart Disease")
plt.show()
No description has been provided for this image
In [13]:
plt.figure(figsize=(15, 8))
sns.lineplot(y=df["bmi"], x=df["age"], hue=df["smoking_status"])
plt.show()
No description has been provided for this image
In [14]:
plt.figure(figsize=(15, 8))
sns.barplot(x=df["chest pain type"], y=df["age"], hue=df["smoking_status"])
plt.title("Chest Pain Type vs. Age on Smoking Status")
plt.show()
No description has been provided for this image
In [15]:
plt.figure(figsize=(20,12))

plt.subplot(2,2,1)
sns.histplot(df["age"], kde=True, color="#cdb4db")
plt.title("Age Distribution")

plt.subplot(2, 2, 2)
sns.histplot(df["cholesterol"], kde=True, color="#219ebc")
plt.title("Cholesterol Distribution")

plt.subplot(2, 2, 3)
sns.histplot(df["max heart rate"], kde=True, color="#9b5de5")
plt.title("Max Heart Rate Distribution")


plt.subplot(2, 2, 4)
sns.histplot(df["blood pressure"], kde=True, color="#3a5a40")
plt.title("Blood Pressure Distribution")

plt.show()
No description has been provided for this image



Data Preprocessing¶

Imputation¶

In [16]:
# Handling Missing values 
missing_values_data=df.isna().sum()[df.isna().sum()>0]

sns.heatmap(df.isnull(), cbar=False, cmap="viridis", yticklabels=False)
plt.title("Heatmap of Missing Values")
plt.show()
No description has been provided for this image
In [17]:
#Calculating the percentage of missing values
missing_percentage = round((missing_values_data / len(df)) * 100, 2)

missing_data_summary = pd.DataFrame(
    {
        "Missing Values": missing_values_data[missing_values_data > 0],
        "Percentage (%)": missing_percentage[missing_values_data > 0],
    }
).sort_values(by="Percentage (%)", ascending=False)

print(missing_data_summary)
        Missing Values  Percentage (%)
triage             410            5.89
gender               1            0.01

In [18]:
# Handling triage
# df['triage'].value_counts()
In [19]:
# Null values filled with mode
# df["triage"] = df["triage"].fillna("yellow")

In [20]:
df["gender"].value_counts()
Out[20]:
gender
1.0    3703
0.0    3258
Name: count, dtype: int64
In [21]:
df["gender"] = df["gender"].fillna(0.0)
In [22]:
df.isnull().sum()
Out[22]:
age                    0
gender                 0
chest pain type        0
blood pressure         0
cholesterol            0
max heart rate         0
exercise angina        0
plasma glucose         0
skin_thickness         0
insulin                0
bmi                    0
diabetes_pedigree      0
hypertension           0
heart_disease          0
Residence_type         0
smoking_status         0
triage               410
dtype: int64


Outlier Treatment¶

In [23]:
plt.figure(figsize=(15, 8), layout="constrained", frameon=True)
i = 1
for col in outlier_features:
    plt.subplot(2, 3, i)
    sns.histplot(df[col], kde=True, color="#2a9d8f")
    plt.title(col)
    i += 1
plt.show()
No description has been provided for this image

Transform Data to Reduce Impact

  • Log Transformation (Best for right-skewed data)
In [24]:
outlier_features.columns
Out[24]:
Index(['cholesterol', 'plasma glucose', 'insulin', 'bmi', 'diabetes_pedigree'], dtype='object')
In [25]:
# cholesterol
df["cholesterol"] = np.log1p(df["cholesterol"])

In [26]:
# plasma glucose
df["plasma glucose"] = np.log1p(df["plasma glucose"])

In [27]:
# insulin
df["insulin"] = np.log1p(df["insulin"])

In [28]:
# bmi
df["bmi"] = np.log1p(df["bmi"])

In [29]:
# diabetes_pedigree
df["diabetes_pedigree"] = np.log1p(df["diabetes_pedigree"])

In [30]:
plt.figure(figsize=(15, 8), layout="constrained", frameon=True)
i = 1
for col in outlier_features:
    plt.subplot(2, 3, i)
    sns.histplot(df[col], kde=True, color="#2a9d8f")
    plt.title(col)
    i += 1
plt.show()
No description has been provided for this image


Encoding all the categorical attributes¶

In [31]:
categorical_data=df.select_dtypes("object")
categorical_data.head(5)
Out[31]:
Residence_type smoking_status triage
0 Urban never smoked yellow
1 Urban never smoked orange
2 Urban never smoked yellow
3 Urban never smoked orange
4 Urban never smoked yellow
In [32]:
# Encoding residence_type

df['Residence_type'].value_counts().index
Out[32]:
Index(['Urban', 'Rural'], dtype='object', name='Residence_type')
In [33]:
Residence_type_map = {"Urban": 0, "Rural": 1}

df['Residence_type']=df['Residence_type'].map(Residence_type_map)

df['Residence_type']
Out[33]:
0       0
1       0
2       0
3       0
4       0
       ..
5105    0
5106    0
5107    1
5108    1
5109    0
Name: Residence_type, Length: 6962, dtype: int64

In [34]:
# Encoding smoking_status

df["smoking_status"].value_counts().index
Out[34]:
Index(['never smoked', 'Unknown', 'formerly smoked', 'smokes'], dtype='object', name='smoking_status')
In [35]:
smoking_status = {"never smoked":0, "Unknown":2, "formerly smoked":0.5, "smokes":1}

df["smoking_status"] = df["smoking_status"].map(smoking_status)

df["smoking_status"]
Out[35]:
0       0.0
1       0.0
2       0.0
3       0.0
4       0.0
       ... 
5105    0.0
5106    0.0
5107    0.0
5108    0.5
5109    2.0
Name: smoking_status, Length: 6962, dtype: float64

In [36]:
# triage
df["triage"].value_counts().index
Out[36]:
Index(['yellow', 'green', 'orange', 'red'], dtype='object', name='triage')
In [37]:
# triage_map = {"yellow": 0, "orange": 1, "green": 2, "red": 3}

# df['triage'] = df['triage'].map(triage_map)

# df['triage']


Standardization¶

In [88]:
X = df.drop("triage", axis=1)
y = df["triage"]
In [89]:
from sklearn.preprocessing import StandardScaler

scaler= StandardScaler()

X[X.columns] = scaler.fit_transform(X)
X
Out[89]:
age gender chest pain type blood pressure cholesterol max heart rate exercise angina plasma glucose skin_thickness insulin bmi diabetes_pedigree hypertension heart_disease Residence_type smoking_status
0 -1.465884 0.938135 1.173314 1.410374 3.129483 0.549734 -0.256573 0.485469 -0.603531 -1.109043 -1.256305 0.033263 -0.277565 -0.202792 -0.751562 -0.771064
1 -0.709841 -1.065945 1.970952 2.339167 -0.086923 -0.485357 -0.256573 -0.870489 -0.428764 -1.247255 -1.463029 0.033263 -0.277565 -0.202792 -0.751562 -0.771064
2 -1.717898 0.938135 1.173314 0.945977 3.129483 -0.485357 -0.256573 0.123639 -0.166614 -0.459755 -0.521504 0.033263 -0.277565 -0.202792 -0.751562 -0.771064
3 -0.793846 -1.065945 2.768591 1.317494 1.046546 -0.485357 3.897525 -1.021924 -0.253998 0.458231 -1.463029 0.033263 -0.277565 -0.202792 -0.751562 -0.771064
4 -0.289817 0.938135 1.970952 1.874771 0.437322 -0.485357 -0.256573 0.485469 1.449976 -1.756125 -0.872181 0.033263 -0.277565 -0.202792 -0.751562 -0.771064
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5105 1.894305 -1.065945 -0.421962 0.063623 -1.150619 0.161575 -0.256573 -0.460738 -1.127831 -0.099801 -1.337727 0.033263 3.602766 -0.202792 -0.751562 -0.771064
5106 1.978310 -1.065945 -0.421962 0.620899 -0.981776 -0.226584 -0.256573 1.036404 -1.477364 -1.317504 1.636766 0.033263 -0.277565 -0.202792 -0.751562 -0.771064
5107 1.978310 -1.065945 -0.421962 0.806658 0.092503 -1.455754 -0.256573 -0.494609 -0.690914 -0.907201 0.587230 0.033263 -0.277565 -0.202792 1.330562 -0.771064
5108 -0.541832 0.938135 -0.421962 0.620899 -0.817154 -0.097198 -0.256573 2.096238 -0.996756 -1.041048 -0.106963 0.033263 -0.277565 -0.202792 1.330562 -0.149607
5109 -1.129865 -1.065945 -0.421962 0.713778 -0.234070 0.549734 -0.256573 -0.393462 0.008152 0.185336 -0.017066 0.033263 -0.277565 -0.202792 -0.751562 1.714764

6962 rows × 16 columns

In [90]:
y
Out[90]:
0       yellow
1       orange
2       yellow
3       orange
4       yellow
         ...  
5105    yellow
5106    yellow
5107    yellow
5108     green
5109    yellow
Name: triage, Length: 6962, dtype: object
In [41]:
df.to_csv('cleaned_patient_dataset.csv', index=False)

Correlation between all the attributes¶

In [91]:
plt.figure(figsize=(15, 7), layout="constrained")
sns.heatmap(data=X.corr(), annot=True, cmap='Blues')
plt.show()
No description has been provided for this image



Model Training¶

K-Means Clustering¶

In [44]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

inertias =[]
sillhoetteScore = []

for k in range(2,30):
    Kmeans = KMeans(
            n_clusters=k,
            init='k-means++',
    )
    result=Kmeans.fit(X)
    inertias.append(result.inertia_)
    score = silhouette_score(X, result.labels_)
    sillhoetteScore.append(score)
In [45]:
plt.figure(figsize=(12, 8))
plt.subplot(2,1,1)
plt.plot(range(2, 30), inertias, "--")

plt.ylabel("Inertia ")

plt.subplot(2, 1, 2)
plt.plot(range(2, 30), sillhoetteScore, "--")
plt.xlabel("Value of K")
plt.ylabel("Silhouette Score")
plt.show()
No description has been provided for this image
In [92]:
optimal_k=11
In [93]:
kmeans = KMeans(n_clusters=optimal_k, init="k-means++",random_state=42)
kmeans.fit(X, y)
Out[93]:
KMeans(n_clusters=11, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=11, random_state=42)
In [ ]:
# 2D Visualization using TSNE

from sklearn.manifold import TSNE

tsne = TSNE(2 , perplexity=200, n_iter=300)
components_tsne = tsne.fit_transform(X)
In [104]:
Kmeans_data = np.vstack((components_tsne.T, kmeans.labels_)).T
Kmeans_data
Out[104]:
array([[ 6.60471678, -0.77885401,  3.        ],
       [ 5.52436733, -1.86486256,  3.        ],
       [ 6.69082737, -0.77648717,  3.        ],
       ...,
       [-4.02759886, -3.83110499,  1.        ],
       [-3.54180551,  2.8170104 ,  1.        ],
       [ 0.67043436, -2.31422114, 10.        ]])
In [97]:
Kmeans_tsne = pd.DataFrame(Kmeans_data, columns=["X1", "X2", "clusters"])
Kmeans_tsne.head(10)
Out[97]:
X1 X2 clusters
0 6.604717 -0.778854 3.0
1 5.524367 -1.864863 3.0
2 6.690827 -0.776487 3.0
3 9.745673 -0.809553 2.0
4 5.912214 0.667596 3.0
5 6.770416 -0.724922 3.0
6 5.942784 -2.004708 3.0
7 6.027802 0.820241 3.0
8 9.723577 -0.259638 2.0
9 6.455491 -2.123156 3.0
In [51]:
plt.figure(figsize=(15, 8))
sns.scatterplot(x="X1", y="X2", hue="clusters", data=Kmeans_tsne, palette="tab10")
plt.title("KMeans++ Clustering")

plt.show()
No description has been provided for this image

Gaussian Mixture Model¶

In [131]:
from sklearn.mixture import GaussianMixture

gmm = GaussianMixture(n_components=11, covariance_type="full")
gmm_model=gmm.fit(X)
In [132]:
GMM_labels = gmm_model.predict(X)
GMM_labels
Out[132]:
array([8, 8, 8, ..., 7, 7, 6], dtype=int64)
In [133]:
from sklearn.metrics import silhouette_score

silhouette_score(X, GMM_labels)
Out[133]:
0.11817354634449294
In [134]:
GMM_data = np.vstack((components_tsne.T, GMM_labels)).T
GMM_data
Out[134]:
array([[ 6.60471678, -0.77885401,  8.        ],
       [ 5.52436733, -1.86486256,  8.        ],
       [ 6.69082737, -0.77648717,  8.        ],
       ...,
       [-4.02759886, -3.83110499,  7.        ],
       [-3.54180551,  2.8170104 ,  7.        ],
       [ 0.67043436, -2.31422114,  6.        ]])
In [135]:
GMM_tsne = pd.DataFrame(GMM_data, columns=["X1", "X2", "clusters"])
GMM_tsne.head(10)
Out[135]:
X1 X2 clusters
0 6.604717 -0.778854 8.0
1 5.524367 -1.864863 8.0
2 6.690827 -0.776487 8.0
3 9.745673 -0.809553 0.0
4 5.912214 0.667596 8.0
5 6.770416 -0.724922 8.0
6 5.942784 -2.004708 8.0
7 6.027802 0.820241 8.0
8 9.723577 -0.259638 0.0
9 6.455491 -2.123156 8.0
In [136]:
plt.figure(figsize=(15, 8))
sns.scatterplot(x="X1", y="X2", hue="clusters", data=GMM_tsne, palette="tab10")
plt.title('GMM Cluters')
plt.show()
No description has been provided for this image

Hierarchical Clustering¶

In [57]:
from scipy.cluster import hierarchy

Z=hierarchy.linkage(X, method='ward')
In [58]:
Z.shape, Z
Out[58]:
((6961, 4),
 array([[6.22000000e+02, 1.05000000e+03, 3.65905944e-01, 2.00000000e+00],
        [1.36400000e+03, 1.36500000e+03, 3.89680688e-01, 2.00000000e+00],
        [6.41000000e+02, 8.96000000e+02, 4.38727642e-01, 2.00000000e+00],
        ...,
        [1.38980000e+04, 1.39090000e+04, 1.09549943e+02, 7.09000000e+02],
        [1.39190000e+04, 1.39200000e+04, 1.23591245e+02, 5.83200000e+03],
        [1.39180000e+04, 1.39210000e+04, 1.85486960e+02, 6.96200000e+03]]))
In [59]:
plt.figure(figsize=(12, 10))
hierarchy.dendrogram(Z)
plt.title('Dendogram of CLusters')
plt.ylabel('Euclidean Distance')
plt.show()
No description has been provided for this image
In [137]:
optimal_k=11
In [138]:
from sklearn.cluster import AgglomerativeClustering

agg_cluster = AgglomerativeClustering(
    n_clusters=optimal_k, metric="euclidean", linkage="ward"
)
agg_labels = agg_cluster.fit_predict(X)
In [139]:
print(f"Silhouette Score: {silhouette_score(X, agg_labels)}")
Silhouette Score: 0.14278899529227268
In [140]:
np.unique(agg_labels), agg_labels
Out[140]:
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10], dtype=int64),
 array([ 4,  4,  4, ...,  3, 10,  1], dtype=int64))
In [141]:
Hierarchy_data = np.vstack((components_tsne.T, agg_labels)).T
Hierarchy_data
Out[141]:
array([[ 6.60471678, -0.77885401,  4.        ],
       [ 5.52436733, -1.86486256,  4.        ],
       [ 6.69082737, -0.77648717,  4.        ],
       ...,
       [-4.02759886, -3.83110499,  3.        ],
       [-3.54180551,  2.8170104 , 10.        ],
       [ 0.67043436, -2.31422114,  1.        ]])
In [142]:
hierarchy_tsne = pd.DataFrame(Hierarchy_data, columns=["X1", "X2", "clusters"])
hierarchy_tsne.head(10)
Out[142]:
X1 X2 clusters
0 6.604717 -0.778854 4.0
1 5.524367 -1.864863 4.0
2 6.690827 -0.776487 4.0
3 9.745673 -0.809553 5.0
4 5.912214 0.667596 4.0
5 6.770416 -0.724922 4.0
6 5.942784 -2.004708 4.0
7 6.027802 0.820241 4.0
8 9.723577 -0.259638 5.0
9 6.455491 -2.123156 4.0
In [143]:
plt.figure(figsize=(15, 8))
sns.scatterplot(x="X1", y="X2", hue="clusters", data=hierarchy_tsne, palette="tab10")
plt.title('Hierarchical Clusters')
plt.show()
No description has been provided for this image

Compare the clustering results of all the algorithms using Inertia and the Silhouette Score.¶

In [145]:
print(f"Hierarchal Clustering Silhouette Score: {silhouette_score(X, hierarchy_tsne['clusters'])}")
Hierarchal Clustering Silhouette Score: 0.14278899529227268
In [146]:
print(f"GMM Clutering Silhouette Score: {silhouette_score(X, GMM_tsne['clusters'])}")
GMM Clutering Silhouette Score: 0.11817354634449294
In [147]:
print(f"KMeans++ Silhouette Score: {silhouette_score(X, Kmeans_tsne['clusters'])}")
KMeans++ Silhouette Score: 0.15781375239444626

The Silhouette Score of all the Models Turns out to be Good

  • The score is near to 1 which is a good sign
 The Best Result is given by Gausian Mixture Model 

Visualize the clusters formed using T-SNE for all the three algorithms.¶

In [148]:
plt.figure(figsize=(20, 8))

plt.subplot(1,3,1)
sns.scatterplot(x="X1", y="X2", hue="clusters", data=hierarchy_tsne, palette="tab10")
plt.title("Hierarchical Clusters")

plt.subplot(1, 3, 2)
sns.scatterplot(x="X1", y="X2", hue="clusters", data=GMM_tsne, palette="tab10")
plt.title("GMM Clusters")

plt.subplot(1, 3, 3)
sns.scatterplot(x="X1", y="X2", hue="clusters", data=Kmeans_tsne, palette="tab10")
plt.title("KMeans++ Clusters")
plt.show()
No description has been provided for this image

Expected Insights¶

Identification of distinct patient groups based on health and lifestyle attributes.¶

In [ ]:
Orignal_data = np.vstack((components_tsne.T, y)).T
Orignal_tsne = pd.DataFrame(Orignal_data, columns=["X1", "X2", "clusters"])
Out[ ]:
array([[6.604716777801514, -0.7788540124893188, 'yellow'],
       [5.524367332458496, -1.8648625612258911, 'orange'],
       [6.690827369689941, -0.7764871716499329, 'yellow'],
       ...,
       [-4.027598857879639, -3.8311049938201904, 'yellow'],
       [-3.5418055057525635, 2.8170104026794434, 'green'],
       [0.6704343557357788, -2.314221143722534, 'yellow']], dtype=object)
In [ ]:
plt.figure(figsize=(20, 15))

plt.subplot(2, 1, 1)
sns.scatterplot(x="X1", y="X2", hue="clusters", data=Orignal_tsne, palette="tab10")
plt.title("Orignal Clusters")


plt.subplot(2, 1, 2)
sns.scatterplot(x="X1", y="X2", hue="clusters", data=GMM_tsne, palette="tab10")
plt.title("GMM Clusters")

plt.show()
No description has been provided for this image

Comparison of clustering algorithms to determine which provides the most meaningful segmentation.¶

In [159]:
plt.figure(figsize=(20, 15))

plt.subplot(2, 2, 1)
sns.scatterplot(x="X1", y="X2", hue="clusters", data=Orignal_tsne, palette="tab10")
plt.title("Orignal Clusters")


plt.subplot(2, 2, 2)
sns.scatterplot(x="X1", y="X2", hue="clusters", data=hierarchy_tsne, palette="tab10")
plt.title("Hierarchical Clusters")

plt.subplot(2, 2, 3)
sns.scatterplot(x="X1", y="X2", hue="clusters", data=GMM_tsne, palette="tab10")
plt.title("GMM Clusters")

plt.subplot(2, 2, 4)
sns.scatterplot(x="X1", y="X2", hue="clusters", data=Kmeans_tsne, palette="tab10")
plt.title("KMeans++ Clusters")
plt.show()
No description has been provided for this image
In [ ]: